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ABSTRACT 

This paper presents a query evaluation technique for posi- 
tive relational algebra queries with aggregates on a represen- 
tation system for probabilistic data based on the algebraic 
structures of semiring and semimodule. The core of our eval- 
uation technique is a procedure that compiles semimodule 
and semiring expressions into so-called decomposition trees, 
for which the computation of the probability distribution 
can be done in time linear in the product of the sizes of the 
probability distributions represented by its nodes. We give 
syntactic characterisations of tractable queries with aggre- 
gates by exploiting the connection between query tractabil- 
ity and polynomial-time decomposition trees. 

A prototype of the technique is incorporated in the prob- 
abilistic database engine SPROUT. We report on perfor- 
mance experiments with custom datasets and TPC-H data. 

1. INTRODUCTION 

This paper considers the evaluation problem for queries 
with aggregates on probabilistic databases. 

The utility of aggregation has been argued for at length. 
In particular, aggregates are crucial for OLAP and decision 
support systems. All 22 TPC-H queries involve aggregation. 

Probabilistic databases are useful to represent and query 
imprecise and uncertain data, such as data acquired through 
measurements, integrated from multiple sources, or produced 
by information extraction [21]. In this paper, we use a rep- 
resentation system for probabilistic data called pvc-tables. 
It is based on the algebraic structures of semiring and semi- 
module to support a mixed representation of aggregated val- 
ues and tuple annotations for different classes of annotations 
and aggregations [2]. The pvc-tables can represent any fi- 
nite probability distribution over relational databases. In 
addition, the results of queries with aggregates can be repre- 
sented as pvc-tables of polynomial size. This contrasts with 
main-stream representation systems such as pc-tables [21], 
which can require an exponential-size overhead [15]. 

The problem of query evaluation is #P-hard already for 
simple conjunctive queries [21]. Aggregates are a further 
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source of computational complexity: for example, already 
deciding whether there is a possible world in which the SUM 
of values of an attribute equals a given constant is NP-hard. 
Existing approaches to aggregates in probabilistic databases 
have considered restricted instances of the problem: they fo- 
cus on aggregates over one probabilistic table of restricted 
expressiveness [4, 20, 16], or rely on expected values and 
Monte-Carlo sampling [10, 12, 22]. Expected values can 
lead to unintuitive query answers, for instance when data 
values and their probabilities follow skewed and non-aligned 
distributions [19]. Abiteboul et al. investigate XML queries 
with aggregates on probabilistic data [1]. An algebra pro- 
posed by Koch represents annotations and data values as 
rings which enables efficient incremental view maintenance 
in the presence of aggregations [13]. 

Our approach considers the problem of exact probabil- 
ity computation for positive relational algebra queries with 
aggregates on pvc-tables. The core of our technique is a 
procedure that compiles arbitrary semimodule and semiring 
expressions over random variables into so-called decompo- 
sition trees, for which the computation of the probability 
distribution can be done in polynomial time in the size of 
the tree and of the distributions at its nodes. Decompo- 
sition trees are a knowledge compilation technique [5] that 
reflects structural decompositions of expressions into inde- 
pendent and mutually exclusive sub-expressions. Flavours 
of decomposition trees have been proposed as compilation 
target for propositional formulas that arise in the evaluation 
of relational algebra queries (without aggregates) on proba- 
bilistic c-tables [18]. It has been shown that more complex 
tasks, such as conditioning probabilistic databases on given 
constraints [14] and sensitivity analysis and explanation of 
query results [11], can benefit from decomposition trees. 

Example 1. Figure 1 shows six pvc-tables, amongst them 
the suppliers table S, the products tables Pi and P2, and 
the table PS pairing suppliers and products. They all have 
an annotation column $ to hold expressions in a semiring 
K generated by a set of independent random variables, with 
operations sum (+) and product (•), and neutral elements 
0k and Ik- Each valuation of the random variables into a 
semiring (e.g. integers or Booleans) canonically maps semi- 
ring expressions into that semiring by interpreting + and • 
as the corresponding operations of that semiring. Each such 
valuation defines a possible world of the database. 

Figure Id shows the result of the query Qi that asks for 
prices of products available in shops. The annotations of the 
result tuples are constructed as follows: The annotation of a 
join of two tuples is the product of their annotations, and the 
annotation obtained from projection or union is the sum of 
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Figure 1: A database containing relations (a) Supplier S, (c) Products Pi, P2, and (b) the many-to-many 
relation ProductSupplied PS, and the results of the positive query Qi and of the aggregate query Q^. 



the annotations of the participating tuples [7]. For instance, 
the tuple (M&S, 10} has the annotation xij/n (21+25), whose 
probability distribution can be computed as a function of 
probability distributions of the random variables x\,y\\,z\, 
and 25 [21]. 

Consider the query Q2 from Figure le that asks for shops 
in which the maximal price for the products in Pi or P 2 is 
less than 50. Aggregation is expressed using the w operator, 
which in this query groups by the column shop and applies 
the aggregation MAX on price within each group. 

The annotations of result tuples are built using semi- 
module expressions of the form *]> <g) v, where ^ is a semi- 
ring expression and 11 is a data value. Such expressions 
can be "summed up" with respect to aggregation opera- 
tions: For MIN, the sum a + min /3 is min(a,/3); for MAX, 
a+max/3 = max(a, /?); for SUM, a+ sum ,S = a + /3. The sums 
correspond to operations in commutative monoids. The an- 
notation $ of M&S in Q2's result is constructed as follows. 
This tuple represents a group of six tuples in the result of 
Qi, all with the M&S shop value. The annotation $ then 
expresses the conditions (1) that the sum of the price val- 
ues of these six tuples in the MAX monoid is less than 50, 
and (2) that the group is not empty (as expressed by ^1). 
Depending on the valuation of the variables in $, these con- 
ditions can be true (T) or false (_L), or, more generally, the 
additive or multiplicative neutral element of the semiring. 

For instance, a valuation vi that maps £1,22, j/n, J/21, 21, 
22, 25 to T and all other variables to _L satisfies $, since 

vi ($) = [T ® 10 + max ± ® 50 + max T ® 11 + max 

± ® 60 + max _L ® 60 + max _L <gi 15 < 50] • T 

= [10 + max 11 < 50] = [max(10, 11) < 50] = T. □ 

If the variables in such expressions are random variables, 
then the expressions themselves can be interpreted as ran- 
dom variables. Moreover, the probability distributions of 
the obtained expressions reflect the probabilities of query 
answers taking particular values in a randomly drawn world 
of the database. Our technique allows to efficiently com- 
pute probabilities defined by such expressions by structural 
decomposition. For example, an expression a = ab ® 10 + 
xy ® 20 can be decomposed in independent sub-expressions 
ab Cg> 10 and xy <g> 20 that do not share variable symbols and 
hence constitute independent random variables. 

The structure of the paper follows the list of contributions: 

• We present an evaluation framework for queries with 
aggregates (SUM, PROD, COUNT, MIN, MAX) on 
pvc-tables, a representation system for probabilistic 
data based on semirings and semimodules. 



• We devise a technique for computing the exact proba- 
bility distribution of query results based on a generic 
compilation procedure of arbitrary semimodule and 
semiring expressions into so-called decomposition trees, 
for which the computation of the probability distribu- 
tion can be done in time linear in the product of the 
sizes of the distributions represented by its nodes. 

• We give a syntactic characterisation of a class of aggre- 
gate queries that are tractable on tuple-independent 
databases. Our query tractability result follows from 
the observation that the semiring and semimodule ex- 
pressions in the result of our tractable queries admit 
polynomial size decomposition trees and polynomial 
size probability distributions at their nodes. 

• A prototype of our technique is incorporated into the 
probabilistic database engine SPROUT. 

• Extensive performance experiments using our own syn- 
thetic datasets and TPC-H data are discussed. 

Besides exact computation, decomposition trees also al- 
low for approximate probability computation [18]. Due to 
lack of space, we refer the reader to the MSc thesis of the 
second author [9]. The pvc-tables can be extended to cope 
with continuous probability distributions, similar to the ex- 
tensions of pc-tables in the PIP system [12]. 

2. PRELIMINARIES 

2.1 Induced Discrete Probability Space 

Let S be a countable set and X be a finite set of S- valued 
independent random variables. We denote by P x the discrete 
probability distribution of a variable x G X, and by P x [s] 
the probability that x takes value s G S; we often specify 
P x by the set of pairs of unique values with their non-zero 
probabilities, {(s,Pc[s]) | s G S and P x [s] > 0}. The size of 
a probability distribution is the size of its set representation. 

Definition 1. Let Q = {v : X +» S} be the set of map- 
pings from X into S. A probability mass function 

Pr(iz) = J] P x [v(x)] 

for every sample v G fi, and a probability measure 
Pr(P) = Pr H f° r allECfl 

define a probability space (f2,2 n ,Pr) that we call the proba- 
bility space induced by X. 
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The probability distribution of the sum of two indepen- 
dent random variables is the convolution of their individual 
distributions [8]. For instance, given two random variables 
x, y over positive integers, the probability that the sum of 
the random variables equals to 4 is the sum of the probabil- 
ities of x being and y being 4, of x being 1 and y being 3, 
and so on. The applicability of convolution to determine the 
probability distribution of a function of independent random 
variables is not limited to sums of integers: 

Proposition 1. Given sets A, B , C , an A-valued random 
variable x with probability distribution P x , a B-valued ran- 
dom variable y with probability distribution P y , x indepen- 
dent of y, and an operation • : Ax B — > C ', the probability 
distribution P x . y of the C -valued random variable z — x • y 
is the convolution of P x and P y with respect to •: 

Px, y [c]= Px[a]P v [b] forallceC (1) 

(a,b)eAxB: 
c—a»b 

Example 2. The formula for the probability Pj. v * of the 
disjunction <E> V VP of two independent Boolean random vari- 
ables is a special case of Eq. (1): 

P*v*[T]= Yl P*[a]P*[b] 

(a,b)eBxB: 
T — aVb 

= P*[T]P*[T] + P*[_L]P*[T] + P*[T]P*[_L] 

= 1-(1-P*[T])(1-P*[T]) □ 

Remark 1. The sum in Eq. (1) is invariant under the re- 
striction to those a and 6 for which P x [a] > and P y [b] > 0. 
Hence, even if the cardinality of A x B is infinite, the convo- 
lution is a finite sum whenever only finitely many elements 
in A and B have non-zero probability. □ 

2.2 Monoids, Semirings and Semimodule 

Our representation system for probabilistic data is based 
on the notions of monoid, semiring, and semimodule. 

Definition 2. A monoid is a set M with an operation 
+ : M x M — > M and a neutral element € M that satisfy 
the following axioms for all mi, m 2 , mz € M: 

(mi + m 2 ) + m 3 = mi + (m 2 + m 3 ) 
+ mi = mi + = mi 

A monoid is commutative if mi + m 2 = m 2 + mi . 

Monoids naturally describe many aggregation operations. 
Aggregation over a column of a relation fixes a domain of 
values, usually K or N, and a binary operation. For instance, 
the MIN of a column with values vi , • • • ,«„6lis 

min(i>i, ■ • • , v„) = min(vi,min(v2, • • • min(i> n _i,v„) •••))• 

The binary operation is commutative and associative, i.e., 
the value of the aggregation is invariant under the order in 
which it is computed. Furthermore, each aggregation oper- 
ation has a neutral element, i.e., a value that does not con- 
tribute to the aggregation. For example, G K is the neutral 
element for SUM, and oo is the neutral element for MIN. 
These aggregations correspond to commutative monoids, in 
particular: SUM = (N, +, 0), MIN = (N ±0 °, min, +oo), MAX 
= (N ±0 °,max, -oo). COUNT is a special case of SUM. More 
complicated aggregations (e.g., AVG) can conceptually be 
composed from simpler ones (e.g., SUM and COUNT), but 
their treatment is out of the scope of this paper. 



Definition 3. A commutative semiring is a set S to- 
gether with operations +,■ : Sx5-> 5 and neutral elements 
0,1 € S such that (S, +, 0) and (S, ■, 1) are commutative 
monoids and the following holds for all si, s 2 , S3 € S: 

S 2 ' (s 2 + S3) = (Si ■ S 2 ) + (Si • S3) 
(si + s 2 ) • S3 = (si • S3) + (s 2 • S3) 
• si = si ■ = 0. 

Commutative semirings are the canonical algebraic struc- 
ture for tuple annotations [7] . Annotations from the Boolean 
semiring yield set semantics, annotations from N correspond 
to bag semantics, and annotations from the security semi- 
ring can be used to constrain access to query results depend- 
ing on access rights to database tuples that contributed to 
the result [2]. The most general semirings are those gener- 
ated over a set of variables. Intuitively, the carrier of such 
semirings are syntactic expressions built from the variables 
and the multiplication and sum symbols, where elements are 
identified via the semiring laws. For example, given a set 
X = {xi,X2,xs} of variables, the elements of the semiring 
PosBool(X) are positive Boolean expressions, e.g. xi + x 2 
or £1(2:2 +2:3). By the distributivity law in semirings, the 
expressions Xi(x2 + £3) and £i:r 2 + X1X3 are equal. A more 
general freely generated semiring is the ring of polynomials 
(also called free commutative algebra) over a set X of vari- 
ables. Elements of generated semirings are called semiring 
expressions which we denote by K throughout this paper. 

Definition 4. Let (S,+s,0s,-sAs) be a commutative 
semiring. An S-semimodule M consists of a commutative 
monoid (M, +m, 0m) and a binary operation ® : S x M — > 
M such that for all si,s 2 € S and mi,m 2 € M we have 

si (g) (mi +m m 2 ) — si (g) mi +m si (g> m 2 
(si +s s 2 ) ®mi = si ® mi +m s 2 ® mi 
(si -s s 2 ) ® mi = si (g> (s 2 ® mi) 
si ® 0m = 0.5 ® mi = 0m 
Is ® mi — mi. 

We write S*®M to denote a S-semimodule M, and write • 
for -s and + for +s,+m whenever the meaning is unam- 
biguous. Semimodules combine monoids with semirings to 
represent aggregation values conditioned on the value of a 
semiring expression. Analogous to the case of semiring ex- 
pressions that correspond to freely generated semirings, we 
denote by semimodule expressions the elements of the K- 
semimodule generated by a given monoid. The semimodules 
we use frequently are N®N and B®N for MIN, MAX, SUM, 
and PROD monoids. A semimodule B ® N over SUM would 
not have the intuitive semantics; this reflects the well-known 
incompatibility of SUM aggregation with set semantics. 

2.3 Query Language 

The query language under consideration is a restriction 
of positive relational algebra extended by an operator w for 
aggregation and grouping. Given a relation R over schema 
E, the operator vj in the query 

ro I ;ai ^AGGi(fli ),..., a, ^AGG i (B ! )(-R)i 

where (Au {Bi,...,Bi}) C E, groups by the attributes 
A and takes the aggregations AGGi to AGGi over the at- 
tributes Bi to Bi respectively. The result has schema A U 
{qi, . . . , ai}, where ai to q; are aggregation attributes. We 
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$ ::= x | $ + $ | $ ■ $ | [a 9 a] | [$ $] | s 

a ::= $®m {+ op $ (g> m} | m 

op ::= min | max | count | sum | prod 

0::= = | ^ | < | > | < | > 

x ::= A variable symbol x G X 

m ::= A value from an aggregation monoid M 

s ::= A value from the semiring S 

Figure 2: Grammar for semiring expressions $ 6 A" 
and semimodule expressions a G (if US)8 M. 

consider SUM, PROD, COUNT, MAX, and MIN aggrega- 
tions, and restrict the queries such that projection, union 
and grouping are never applied to aggregation attributes. 
This restriction simplifies query rewriting, but is not requi- 
site for our query evaluation method. Out of the 22 TPC-H 
queries, only query Q13 violates the restriction. 

Definition 5. The query language Q considered in this 
paper consists of queries that are built using the relational 
operators 5, o~, n, x , U, vj and satisfy the following constraints: 

1. Inn A (Q) anduj A . ai ^ AGGl(Blh , ^ ai ^ AGGl(Bl) (Q), the 
attributes in A are not aggregation attributes. 

2. In Q1UQ2, the attributes of Qi and Q2 are not aggre- 
gation attributes. 

Example 3. The TPC-H query Ql has the structure 
SELECT A,SUM(B) FROM R GROUP BY A which is equivalent 
to ro^^^suMfBjl-R)- TPC-H query Q2 has the structure 
SELECT A FROM R WHERE B = (SELECT MIN(C) FROM S),or, 
cquivalently, ~kao~b=~i 

The query R U ^A;/3<-sum(s) (S) is not in Q, since the 
second union term is a relation with the aggregation at- 
tribute /3 and hence violates constraint 2 in Def. 5. However, 
tta(R) U nAO- / 3> 5 (zu A . l3 ^ SVM(B) (S)) is a valid Q-query. □ 

3. PVC-TABLES: A REPRESENTATION 
SYSTEM FOR PROBABILISTIC DATA 

In this section we introduce a succinct and complete repre- 
sentation system for probabilistic data, which we call proba- 
bilistic value-conditioned tables or pvc-tables for short. This 
system is based on work in databases with provenance in- 
formation [2, 7, 3]. The reason for using pvc-tables, as op- 
posed to main-stream representation systems such as pc- 
tables (and special cases such as tuple-independent or BID 
tables) [21], is that pvc-tables fit naturally with aggregate 
queries: answers to aggregate queries on pvc-tables or even 
on pc-tables are representable as pvc-tables of size polyno- 
mial in the size of the input tables, while they may require 
an exponential overhead when represented as pc-tables [15]. 

Aggregation on pvc-tables is handled using a mixed repre- 
sentation of tuple annotations and aggregated values using 
the algebraic structures of semirings and semimodules. 

Definition 6. A pvc-tablc T over a probability space fi 
induced by a set X of variables is a relation with an anno- 
tation column $ holding semiring expressions over X, and 
where the tuple values can be constants or semimodule ex- 
pressions over X. A pvc-database D — {Ti, . . . , T n } is a set 
of pvc-tables over the same probability space Q. 
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Figure 3: Three possible worlds of pvc-table S in 
Figure 1 under two different semirings: Sm is anno- 
tated with expressions from the Boolean semiring, 
and and Sn,2 with integers. 

The semantics of D is given by its possible worlds: 

{{{u(t)\teT 1 },...,{v{t)\teT n }} \ ven). 

where the mapping v is applied to all expressions in tuples t 
and is identity for constants. Each mapping v G defines 
a possible world. 

Columns that hold semimodule expressions are called ag- 
gregation columns. Semimodule expressions over different 
monoids can co-exist in a pvc-table. The annotation column 
$ hosts semiring expressions that are generated by the vari- 
able set X, and conditional expressions representing com- 
parisons of expressions and constants. The other columns 
of a pvc-table host constants and semimodule expressions. 
A grammar for such expressions is given in Figure 2. All 
these types of expressions can occur in pvc-tables represent- 
ing the result of aggregate queries; we show how to compute 
these expressions for any Q-query in Section 4. 

Example 4. Figure 1 shows six pvc-tables. The pvc- 
table S has annotations from a semiring over the set X of 
variables xi, . . . ,x&. Figure 3 shows a few possible worlds 
for this pvc-table for different semirings. First, consider val- 
uations of the variables X into the Boolean semiring B. Pos- 
sible world Sb in Figure 3a has probability P X1 [_L] • P X2 [T] • 
P X3 [_L] • P Xi [_L] • P X5 [T] . Since B has only the two elements 
_L, T, the number of possible worlds is 2 5 . Secondly, con- 
sider valuations of X into N. Figure 3b shows two possible 
worlds, with respective probabilities P X1 [2] • P X2 [1] • P X3 [1] • 
Px 4 [7] ■ P* 5 [2] and P X1 [0] • P X2 [1] • P X3 [3] • P X4 [7] ■ P X5 [2] . □ 

The pvc-tables PS, Pi, P2, and Q\ in Figure 1 contain no 
semimodule expressions and are in fact pc-tables [21], a sim- 
pler representation system where tuple values are constants 
and annotations are expressions from semirings generated 
by a set of random variables. In pvc-tables, however, semi- 
module expressions can appear as tuple values. 

Example 5. Consider an aggregation AGG on the weight 
column of relation Pi in Figure 1. The semimodule expres- 
sion representing the aggregated value is a = z\ ® 4 +agg 
Z2®8 +agg z-a®7 +agg £4®6, where the +agg operator 
depends on the particular aggregation monoid AGG. □ 

We next look at the semantics of expressions under valu- 
ations of variables; this will be extended to a probabilistic 
interpretation in Section 5. 

Semiring, Monoid, and Semimodule Homomorphism. 

Let K be the semiring generated by X; the variables in X arc 
themselves elements of K, i.e. X C K. Given another semi- 
ring S, a mapping v : X — > S of the variables can be uniquely 
extended to a semiring homomorphism v : K — > S that 
"evaluates" semiring expressions from K to elements in S. 
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Database Semantics 


S 


Probability Distributions 
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Set 
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P x [T] = lorP x [_L] = l 
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3n € N : P x [n] = 1 


Probabilistic 


Set 
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p x [T],p x [±] e [0,1] 


Probabilistic 


Bag 


N 


Vn € N : PJn] G [0,1] 



Table 1: Database semantics for different semirings 
S and probability distributions. 

A similar construction holds for semimodule: For semimod- 
ule expressions W and a monoid M, a mapping v : X — > S 
is extended to a monoid homomorphism v : W — > M. 

Example 6. Consider the semimodule expression 

a = xy ® 5 + min (x + z) ® 10 

over the semiring generated by X = {x, y, z} and the mo- 
noid (N, + m in,oo). The map v : X — > N evaluates variable 
symbols to the semiring of positive integers and induces a 
monoid homomorphism that maps a to positive integers. 
For instance, the mapping v : x i-> 2, y M> 3, z <-¥ yields 

!^(a) = v(x)v(y) ® 5 + min (^(x) + ® 10 

= 6 ® 5 + min 2 ® 10 = 1 ® 5 + min . . . +min 1 ® 5+ m i 
1 ® 10 + min . . . +min 1 ® 10 = 5 + mi n 10 = 5. 

Similarly, the value of a in Example 5 depends on the par- 
ticular target semiring S and the valuation of the variables. 
For instance, a i-t 24 for SUM aggregation and semiring 
N with 21, z-2 H» 2 and 2:3,24 i-> 0. Also, a H» 6 for MIN 
aggregation and the Boolean semiring with zi n 1 and 
22, 23, 24 M> T. If all variables are mapped to 0s, the query 
answer is 0m, i-e. in case of SUM and +00 for MIN. □ 

Conditional Expressions have the form [a9f3], where a 
and ft are semimodule expressions for (possibly different) 
aggregation monoids and semirings generated by X, and 9 is 
a binary relation. Such conditional expressions can appear 
as tuple annotations in pvc-tables. Given a valuation v : 
X — > S and (semiring or semimodule) expressions $, \&, the 
semantics of [$#*] is defined by extending v via 



j/([$0*]) 



Is, if v{$) !/(*) 
0s, otherwise 



(2) 



Comparisons operators <,> for 9 only make sense for 
ordered carriers. In the light of Eq. (2), we can equivalently 
see [$#*] as a binary operation SxS->SorMxM->S. 

Example 7. The annotations in the pvc-table Q2 in Fig- 
ure 1 contain conditional expressions representing compar- 
isons between semimodule expressions and a constant from 
M and semiring expressions and the constant 0k- D 

Set vs. Bag Semantics. The pvc-tables system is generic 
enough to encode both deterministic and probabilistic data- 
bases with set and bag semantics. Table 1 summarises how 
different choices for the semiring S and probability distribu- 
tions for variables x € X give rise to those semantics. 

Under the Boolean semiring, annotation expressions are 
evaluated to _L or T, denoting tuples in or not in the data- 
base instance, respectively. If in addition every variable 
x G X has either P4-L] = 1 or P X [T] = 1, then we have the 
case of deterministic databases with set semantics, i.e. there 
is exactly one possible world with non-zero probability. 

For deterministic bag semantics, tuple annotations arc 
evaluated to N and represent tuple multiplicities and for 



each variable x G X there is exactly one n G N for which 
Px[n] = 1. For probabilistic databases with set semantics, 
every answer tuple has a probability for being true or false, 
while in the case of bag semantics, the pvc-table encodes a 
probability distribution over the multiplicity of its tuples. 

Existing representation systems for probabilistic data are 
over the Boolean semiring, i.e., their semantics is set-based. 
While this choice is justified in the absence of aggregation 
(or for MIN/MAX), it is not for COUNT/SUM/PROD ag- 
gregates, which require bag semantics. In the latter case, 
we need a semiring for which there is a homomorphism into 
N [2]. The set-based probabilistic database model is sub- 
sumed by the probabilistic bag-based model: Every x G X 
is an N-valued random variable, yet the probability distri- 
butions P x are non-zero only for 0, 1 G N. 

4. QUERY EVALUATION I: COMPUTING 
THE TUPLES IN THE QUERY RESULT 

Our approach to query evaluation of Q-queries on pvc- 
tables has two logical steps: In the first step we compute 
the tuples in the query result, and in the second step we 
compute their probability distributions. We discuss the first 
step in this section and the second step in the next section. 

The tuples in the result of a query Q can have semimodule 
expressions as values and semiring expressions as annota- 
tions. The construction of such expressions can be done by 
a query that can be statically inferred from Q [2]. Figure 4 
gives a translation [•] of Q queries into SQL queries that 
compute the tuples in the query results (we chose SQL here, 
but the translation target can also be Q). The rewriting ac- 
counts for joint and alternative use of data: Joint use of data 
(as in join) corresponds to multiplication in the annotation 
semiring and alternative use of data (as in union or projec- 
tion) corresponds to summation in the annotation semiring. 
Semimodule expressions are constructed for the aggregation 
and grouping operator vs. The translation of the opera- 
tors rename, selection, projection, product, and union is the 
same as for pc-tables [21]. In case of zu, the translation 
depends on the presence of group-by attributes. The trans- 
lation uses the custom SQL operators J^agG'Ek''^ an d 
® to construct annotation expressions. 

Example 8. The query TO 0;C( ^ AGG(wci g ht) (Pi) from Ex- 
ample 5 is rewritten into select 5^ AGG (R. < I > ® R. weight) 
as a, 1 k as $ from (select * from Pi) R. The answer 
to the rewritten query is one tuple of value a = 21 ®4 +agg 
22 ® 8 -f agg 23 ® 7 +agg 24 <8> 6. The annotation of this 
tuple is the constant Ik, stating that the tuple (a) is the 
query answer in all possible worlds. The evaluation of a 
may however yield different outcomes in different worlds. 

The query 7T0iT5< a (TO0 ;a< _MiN(weight)(-Pi)) is rewritten to 
select S.$-[5<S.a] as <3> from (select J2aiu{ ® 
R. weight) as a, Ik as <I> from (select * from Pi) R) 

5. This Boolean query asks for the probability of the min- 
imum weight of articles being larger than 5. The result 
of this query is a single empty tuple with annotation $ = 

ljf • [Zl ® 4 +min 2 2 ® 8 +min 2 3 ® 7 + m i n 24 ® 6 < 5] . □ 

As seen in the above example, for aggregation without 
grouping the resulting semimodule value is annotated with 
Ik, i-e., it "exists" in every possible world. In case of group- 
ing, the resulting tuples are additionally annotated with a 
conditional expression to enforce that the group is not empty 
and thus at least one tuple in the group has a annotation 
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[.R] = select R.*, R.$ from R 

[<Wa(Q)1 = select R.*, R.A as B, R.$ as <I> from ([Q]) R 

[cta0b(Q)] = select R.*, R.$ - K [ABB] as <I> from ([Q]) R 

[tAi,...,a„(Q)1 = select R.Ai, . . . , R.A„, ^2 K (R-&) as $ from ([Q]) R group by R.Ai, . . . , R.A n 

[Qi x Q 2 j = select R.*, S.*, R.$ - K S.<I> as <S> from ([QJ) R, ([Q 2 ]) S 

IQi UQ2I = select R.*, (_R.$) as <I> from(select * from ([Qi]) union all select * from ([Q2])) R group by R.* 

l^A 1 ,...,A n ; ai ^AGG 1 (B 1 ),...,a l ^AGG l (B l )(Q)l = select R.Ai, . . . , R.A n , Ti as a lt . . . , T ; as a t , 

[(J2 K R$ ) * ° K ] as * from (KD R g rou P ^ RAl > ■ ■ • ' RA ™ 
[ ro 0;ai^AGGi(S 1 ),...,a ! ^AGG ! (B;)( ( 3)]l = sclcct Tl as ai >' ■ ■> F ' as a ' ' ^ as * from (IQi) R 

fE A rr (R.*®R.Bi) if AGGi = MIN, MAX, SUM, PROD 
where Fi = < „ *, •, 

\ Esum ® 1) if AGGi = COUNT 



Figure 4: Recursive algorithm [•] for rewriting a Q query to account for computation of semiring (K) and 
semimodule expressions. We assume that R.*, S.* do not select column $. 



different from Ok in a possible world. This has been exem- 
plified in the introduction for the query Q2- However, this 
conditional expression is not always necessary: 

Example 9. Consider the query 

Q2 = TTshopfP^SO^shopiP^ MIN(pricc) [Ql] 

similar to Q2 on the database in Figure 1. Under the Boo- 
lean semiring K, the variables can take values _L = Ok 
and T = Ik- In a possible world with £1,2:2, £3 t-¥ _L, 
(M&S) is not an answer since there is no supplier for the 
shop M&S in the input instance S. Its annotation evaluates 
to [00 < 50] ■ _L = _L • _L = _L. Here, the conditional expres- 
sion *&i from Figure 1 would not be necessary, since it is 
implied by the first conditional expression in the expression 
$. In case of other monoids, such as MAX (as discussed in 
the introduction), SUM, or COUNT, *i is necessary. □ 

The constraints imposed on the query language Q by Def- 
inition 5 simplify the rewriting [•]: Since projections and 
unions are only done on non-aggregate columns, the rewrit- 
ing rules for those operators can assume the input tuples to 
be free of semimodule expressions. 

Closure of pvc-tables under Q queries. Since pvc-tables 
generalise pc-tables and the latter are complete in the sense 
that any finite probability distribution over relational data- 
bases can be represented using pc-tables [21], it follows that 
pvc-tables also form a complete representation system. In 
particular, they are closed under Q queries, in the sense that 
for any input pvc-database, the result of any Q query can 
be represented as a pvc-table. The pvc-tables representing 
query results are also succinct (for a fixed Q query), since 
the translation [•] only constructs SQL queries of size linear 
in the size of the input Q queries and the evaluation of such 
SQL queries is in polynomial time data complexity. 

Theorem 1 (Completeness and Succinctness). 

1. Any finite probability distribution over relational data- 
base instances can be represented by pvc-databases. 

2. Given a pvc-database V and a fixed Q query Q, the 
query result IQ] (T>) can be represented as a pvc-table 
with size polynomial in the size ofV. 



In particular, pvc-tables can be exponentially more succinct 
than pc-tables. The key difference is that pvc-tables allow 
for values and annotations to be intertwined in semimodule 
expressions, which can encode exponentially many possible 
outcomes of an aggregation in polynomial space. In pc- 
tables, we would need to enumerate all these outcomes [15]. 

5. QUERY EVALUATION II: PROBABILITY 
COMPUTATION BY DECOMPOSITION 

This section presents a novel approach to computing prob- 
ability distributions for query results on pvc-tables which is 
equivalent to computing probability distributions of the re- 
sults' semiring and semimodule expressions. Our approach 
is based on compiling expressions into a tractable form called 
decomposition trees, for which distributions can be com- 
puted efficiently. A key property of our approach is its gen- 
erality, as it applies to different semirings and semimodules. 
Expressions as Random Variables. We first show how 
semiring and semimodule expressions — including condi- 
tional expressions — in pvc-tables give rise to random vari- 
ables. Let S be a countable semiring and M a monoid, X 
a set of S- valued random variables, and fi the probability 
space induced by X; if and K Cg> M are the sets of semiring 
and semimodule expressions over X. A semiring expression 
<I> G K can be seen as a random variable $ : fi — > S by 
letting $ : v h-> and with probability distribution 

P$[s] = Pr({^ G fi | v($)=s}) = Pr W- ( 3 ) 

i/(3>)— s 

A semimodule expression a G K($M is a random variable 
a : fi — > M with a : v v- > v(a) and probability distribution 
P a defined similar to Eq. (3) for all me M. If M is over R, 
then a is an E-valued random variable. However, since the 
random variables X and hence the probability space fi are 
countable, the set of values that a may take is countable. 
Probability Distributions of Independent Expressions 
Two semiring or semimodule expressions are (syntactically) 
independent if their sets of variables are disjoint; indepen- 
dent expressions are independent random variables. 

Example 10. Let X = {x, y, a, b, c}, M = N, <3> = x + y 
and a = a(b + c)® 10 + c<g> 20. Then $ and a are independent 
random variables since their sets of variables are disjoint. □ 
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i ® 10 



b 1 



* \ 

b 2 



Figure 5: A d-tree for a — a(b + c) 10 + c 20 over 
the semimodule N N. The node [J has one child 
for every fc £ N with non-zero probability for c ^— fc, 
i.e. two children for c = 1,2 in the setting of Ex.12. 
The thick (blue) part is a d-tree for the semiring 
component a(b + c) + c, where is replaced by 0. 

The probability distribution of the sum or product of inde- 
pendent expressions is given by their convolution (cf. Propo- 
sition 1) with respect to the addition and multiplication op- 
eration of the semiring and semimodule. Given independent 
expressions $, ^ £ K and a, /3 € K M with probability 
distributions P$, P*, P a ,Pp, we have for all s £ S, m £ M: 



P<S> + 1£ [s 

P$.* [s 

P Q+f3 [m 
P*® Q [m 
P[ a e/3] [s 
P[*e*] [s 



p*N-p*[ S2 ] 

Si,S2G<S:si+S2=s 

P*[8l] ■ &Sr[82] 

si,s 2 €S:si-S2=s 
m l i m 2 €.M :mi + m2 — m 

P*[ai] • P a [mi] 
^ Pa[mi]-P^[m 2 ] (8) 

53 P»[fli]-P*[«a] 

si£S,S2€:S:[si6s2]—s 



(4) 
(5) 
(6) 
(7) 



(9) 



Following Remark 1, the sums in the above equations are 
restricted to summands of non-zero probabilities. 

Example 11. Let S and M be the semiring and the mo- 
noid of natural numbers with standard addition and multi- 
plication. Let $ = x with P x = {(0, 0.3), (1, 0.3), (2, 0.4)}, 
and q = y 5 be a semimodule expression with P y — 
{(1,0.4), (2,0.4), (3,0.2)}. Then a is a M-valued random 
variable with P a = {(5,0.4), (10,0.4), (15,0.2)}. The prob- 
ability distribution of <E> 01 = x (y 5) = (x -s y) 5 
is given by Eq.(7). For example, P*®a[10] = P^P^IO] + 
P:c[2]Pa[5]. The convolution is finite as the probability dis- 
tributions are non-zero for finitely many elements. Further 
possible outcomes for $ a are 0, 5, 15, 20, 30. In case of 
the Boolean semiring, S=B, possible outcomes are and 5 
with P<s>® a [5] = P x [T]P y [T] and P ® a [0] = 1 - P* 8q [5] . □ 

Partitioning into Mutually Exclusive Expressions. 

Given an expression $ and a variable a; £ X that occurs in 
$, the probability distribution of $ can be expressed using 
probability distributions of sub-expressions under valuations 
of x. For any s' £ S, we denote by ^\ x ^ 3 / the expression 
obtained from $ by replacing every occurrence of x by s' . 
Then the probability distribution of $ can be partitioned by 
the probability distributions of expressions <&\ X< - B >: 




P*[ 



£ft 

s'sS 



[«'] ■ p> 



(10) 



Figure 6: D-tree for the semimodule expression 

£42/41 (ZI+Z5)® 15+ m ax £41/4323® 60+ m axa;5j/5l(zi+Z5)® 10 

(part of the annotation of tuple (Gap) in Figure le) 
over the semimodule B0N. The thick (blue) part is a 
d-tree for the semiring component of the expression, 
where is replaced by 0. 

Decomposition Trees (d-trees) are a normal form for 
semiring and semimodule expressions. We next define them 
and show how to compile arbitrary expressions into d-trees. 

Definition 7. Let M be a monoid and K be a semiring 
generated by an S-valued set X of variables and constants 
from S. A decomposition tree, or d-tree, is a tree, where 
each inner node is one of ©, 0, 0, |J, or [0] and each leaf 
node is a variable in X or a constant in S, M , or S M . 
The five types of inner nodes have the following meaning. 

1. A node © with children representing $ and ^ repre- 
sents the expression <3> + ^ , where $ and $ are independent 
expressions in K , or K M respectively. 

2. A node with children representing $ and ^ repre- 
sents the expression $ • 9, where $ and ^ are independent 
semiring expressions in K . 

3. A node with children representing $ and a repre- 
sents the expression $ a, where $ and a are independent 
expressions in K and K M, respectively. 

4- A node [0] with children representing $ and ^ repre- 
sents the expression where <3> and are independent 
expressions in K or K M . 

5. Given a variable x £ X, an inner node \_\ x with children 
representing &\ X ^ S1 , . . . , <&\ x ^ Sn for all those Sj £ S with 
Px[si] 7^ represents the expression $. 

Just like semiring or semimodule expressions, d-trees rep- 
resent probability distributions. Direct implementations of 
Eq. (4) through (10) are efficient procedures to compute the 
probability distribution at any inner node of a d-tree, given 
the probability distributions at its children: Eq. (4) and (6) 
apply to nodes, Eq. (5) applies to nodes, Eq. (7) applies 
to nodes, Eqs. (8), (9) apply to [0], and Eq. (10) applies 
to [J^, nodes. For leaves, we only have the trivial cases of 
a variable x with probability distribution P x , of constant 
s £ S or m £ M with distribution {(s, 1)} and {(m, 1)}, 
and of constant semimodule expressions of the form s m 
with probability distribution {(s m, 1)}. The probability 
distribution of the entire d-tree is the distribution of its root 
and can be computed bottom-up in one pass over the d-tree. 

Example 12. Consider the d-tree from Figure 5 in which 
it is assumed that each variable x £ {a, b, c} has non-zero 
probabilities p x and p x = 1 — p x for values 1 and 2, re- 
spectively. We first compute the probability distribution 
for the left branch of the d-tree under the aggregation mo- 
noid SUM. The distributions are: {(l,pt), (2,pb)} for b, 
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Compile (Expression 3>) 
begin 

if $ has no variables then 

L return $ 
if 3 independent $1, <E> 2 s.t. $1 + $2 = $ then 

[_ return Compile($i) © Compile($ 2 ) 

if 3 independent 3>i,<E> 2 s.t. $1 • $ 2 = $ then 
L_ return Compile($i) Compile(<3? 2 ) 

if 3 independent $1, <E> 2 s.t. $1 $ 2 = <E> then 
[_ return Compile($i) Compile($ 2 ) 

if 3 independent $i,<I>2 s.t. [$i6'$ 2 ] = $ then 
[_ return Compile($i) [6*] Compile(«J> 2 ) 
Choose variable x G X occurring in $ 
_ return \J x (Vs G S,P x [s] ^ 0: Compile($ s )) 



{(1,1)} for 1, {(2,p(,),(3,p 6 )} for 6©1, {(10,1)} for 10 10, 
{(20,p 6 ),(30,p 6 )} for (6 1) 10, {(l,p a ), (2,p a )} for a, 
{(20,p aPi ,), (30,p a p 6 ), (40,p a p 6 ), (60, p a p 6 )} for a(6+l)0lO, 
{(20, 1)} for 1020, {(40,p a p 6 ), (50,p a p„), (60,p a p 6 ), (80, p a p t )} 
for 0(6 + 1) 10 + 1 20, and finally 

{ (40, PaPbPc), (50, p a p b p c ), (60, p a p b p c ), (80, PaPbPc)} 

for the entire left branch. For the right branch one obtains 

{(70, PaPbPc), (80, PaPbPc), (100, PaPbPc), (120 , papbpc)} ■ 

The probability distribution of the entire d-tree is then: 

{(40, PaPbPc), (50, PaPbPc), (60,paPbPc), (70, PaPbPc) , 
(80,p a pbPc + PaPbPc), (100, paPbpc), (120 , papbpc)} 

In case of MIN aggregation, the distribution for both bran- 
ches as well as for the entire d-tree is {(10, 1)}. In case of 
the Boolean semiring and MIN aggregation, the distribution 
for the left branch (c <— _L) is {(10, p a p&p c ), (<x>,p a pbpc + 
paPbpc+papbpc)} , for the right branch (c T) is {(10, p a p c ), 
(20,p a p c )}, and for the overall d-tree {(10, PaPbpc +p a p c ), 

(20,paPc), (00,PaPbpc+PaPbpc+ PaPbPc)}- □ 

We now turn to analysing the complexity of computing 
the probability distribution for a given d-tree. Since d-trees 
are binary, the distribution of a convolution node can be 
computed in time linear in each of the sizes of the distribu- 
tions of the children according to Eq.(l). The distribution 
of a |J-node is computed in time linear in the number of its 
children and their probability distributions. This implies: 

Theorem 2. The probability distribution P a of a d-tree d 
whose nodes have probability distributions pi , . . . , p n can be 
computed in time 0(J\ |P*D- 

For the SUM monoid, the size of Pd can be exponential in 
the number n of leaves, since there may be exponentially 
many distinct sums out of n numbers. We analyse two 
classes of d-trees for which we can obtain a better upper 
bound on the time complexity of computing Pd- 

Firstly, observe that the sum of two elements in the MIN 
or MAX monoid is one of the elements itself (e.g. a + m i n 6 
is either a or 6). This implies that the size of the proba- 
bility distribution of a semimodule expression a is bounded 
by the number of monoid values occurring at the leaves of 
d. Moreover, if a's variables are N-valued, a's probabil- 
ity distribution is unaltered under the following reduction 
of its variables to B- valued variables: P x [-L] = P x [0] and 
P x [T] = 1 — P x [_L] . Hence, one may equivalently consider a 
d-tree for B-valued variables obtained from this reduction: 

Proposition 2. Given a semimodule expression a over 
MIN or MAX, the distribution P a can be computed in time 
linear in the size of a 's d-tree in which all variables are 
considered by their reduction to M-valued variables. 

Secondly, we analyse SUM aggregation. Evaluating a 
SUM expression ^ i, 8 mi in which all independent 
variables is hard [19]. Yet, instances in which the values 
mi are constrained are tractable: we say that a semimodule 
expression fa 0fii over the SUM monoid N is m-bounded 
if there is a constant m G N such that Vi : < trij < m. Ag- 
gregation over rationals from a bounded set can be regarded 
as bounded if the number of decimal places is fixed, e.g. for 
currencies with 2 decimal places. 



Algorithm 1: Compilation of semimodule or semi- 
ring expressions into d-trees. 

Proposition 3. Let a = ^™=i 0» ® m * ^ e an m-bounded 
semimodule expression over the SUM monoid N where each 
fa is a product of variables and all variables have non-zero 
probability only for 0s and Is- The probability distribution 
Pd of a d-tree d for a can be computed m time 0(n 2 m 2 d). 

Note that every fa may only evaluate to 0s or Is and hence 
the sum is bounded by the product of the number of terms, 
n and m; the product ti-mis thus an upper bound for the 
size of the probability distribution at each node. In partic- 
ular, COUNT aggregation corresponds to mi = 1 for all i. 
Then the resulting probability distribution has at most size 
n and can be computed in time 0(n 2 d). In a d-tree that 
contains semimodule expressions a, (3 over different aggre- 
gation monoids, the complexity of the sub-trees of a and /3 
is each according to Propositions 2 and 3. 
Compiling Expressions into d-trees. 
Algorithm 1 sketches the construction of a d-tree for an in- 
put expression $ by repeatedly applying six decomposition 
rules. The obtained d-tree is equivalent to the input expres- 
sion, i.e., they have the same probability distribution. 

The first four rules check whether the input expression 
can be partitioned and decomposed into two independent 
expressions $i and <3> 2 . In the first rule, $i and <I> 2 are 
either both semimodule or both semiring expressions; in the 
second rule, they are semiring expressions. In the third rule, 
<I>i is a semiring, and $ 2 is a semimodule expression. 

In order to find independent decompositions in polynomial 
time, expressions are analysed at a syntactic level: We at- 
tempt the first rule only if $ is a sum and partition it by the 
connected components in its clause-dependency graph. In 
addition to such syntactic manipulations, the decomposition 
in the second and third rules uses known polynomial-time 
algorithms to recognise read-once expressions, i.e., expres- 
sions where each variable occurs once, and hence factorisc 
expressions based on algebraic rewritings such as the asso- 
ciativity and commutativity laws, e.g., [6, 18]. In particular, 
this approach allows to factor expressions into complex sub- 
expressions and not only into one variable and the residual. 

The last rule decomposes $ into sub-expressions <l? s , for 
each s 6 S. As explained for Eq. (10), each expression <3i s 
is obtained in linear time by replacing x with the constant 
s in $. Many heuristics have been proposed for choosing 
the variable x, since good choices can make the difference 
between polynomial and exponential size decision diagrams 
(such as ordered binary decision diagrams, d-DNNFs, or d- 
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trees). In our implementation, we choose a variable with 
most occurrences [18]. 

Remark 2. The requirement that the underlying algebraic 
structures be commutative and associative is crucial for struc- 
tural decomposition: without these properties, expression de- 
composition would be constrained to the fixed order defined 
by the order of symbols and the nesting of the expression. 

Example 13. Figure 5 depicts a d-tree for the semimod- 
ule expression $ = a(b + c) ® 10 + c ® 20. $ cannot be split 
into independent sub-expressions since variable c occurs in 
both summands. We choose to eliminate variable c to cre- 
ate mutually exclusive events. (This is a good choice since 
it leads to independent sub-expressions.) We thus create a 
node [J with as many children as valuations of c that have 
non-zero probability. Consider the case c <— 1 and hence 
$| c< _i = a(b + 1) ® 10 + 1 ® 20. This corresponds to the 
left branch in the d-tree, and can be decomposed using the 
first rule into its independent summands a(b + 1) ® 10 and 
1 ® 20. The former expression can be decomposed using the 
third rule into independent expressions a and (6 + 1) ® 10. 
The procedure continues until we completely decompose the 
expressions into variables and semimodule constants. 

Figure 6 shows a d-tree for the semimodule expression in 
the first conditional expression in the annotation <E> of the 
result tuple (Gap) in Figure le over the semimodule B ® N: 

£4j/4l(2l + 2 5 ) ® 15 + £4y43Z3 ® 60 + £57/51 (zi + 2 5 ) ® 10 

It cannot be partitioned into independent sub-expressions. 
Choosing variable X4 in the last rule, we create a node |J . 
Its left child $|^ 4 ^± = £53/51(21 + 25) ® 10 can be decom- 
posed using the third rule into £52/51(21+25) and 1®10. The 
former is a read-once expression, on which the second rule 
can be applied twice to decompose it into £5 and the rest, 
then the rest into 2/51 and 21+25. The latter is decomposed 
using the first rule into 21 and 25. The right child 

*U 4 <-T = 2/41(21 + 25) ® 15 + 2/4323 ® 60 + £52/51(21 + 25) ® 10 

can be decomposed using a sequence of the first three rules 
that exploit the independence of sub-expressions. 

By construction of query results using [•], the semiring 
expression in the second conditional expression in $ is pre- 
cisely the semiring part of the above semimodule expression: 

£42/41(21 + 25) + £42/4323 + £52/51(21 + 25), 

We can thus use the very same compilation steps as above, 
with the simplification that only the first two rules, which 
work on semirings, need be applied. The d-tree is the same 
as for the semimodule expression, but where ® nodes are re- 
placed by and without semimodule expressions at leaves. 
Figure 6 shows this d-tree with thick (blue) edges. □ 

Proposition 4. For any semimodule or semiring expres- 
sion Algorithm 1 constructs a d-tree d such that Pg> = P d . 

By virtue of this equivalence, we can compute the probabil- 
ity distribution of any expression by first compiling it into 
a d-tree d and then computing its probability distribution 
Pd- The algorithm is complete since the last rule is always 
applicable until all variables are evaluated. However, ex- 
clusive application of this rule is not desirable: compiling 
arbitrary expressions by applying the last rule only can lead 
to d-trees of size exponential in the number of variables. 
This limitation seems necessary: if we could compile any 



expression $ into a d-tree in polynomial time, then hard 
problems such as satisfiability, tautology, and even proba- 
bility computation can be solved in polynomial time for <3>. 
In practice, however, this rule can be quite effective, as we 
show in the experiments. In case we need only apply the 
first four rules, the compilation finishes in polynomial time. 
This is already known for semiring expressions occurring in 
the results of tractable relational algebra queries without 
repeating symbols [18]. In Section 6, we define a fragment 
of our query language Q consisting of tractable queries that 
only create expressions compilable using the first four rules. 
We finish this section with two observations about our com- 
pilation approach, which cannot be presented at length due 
to space constraints. 

Pruning Conditional Expressions. The evaluation of 
[a#/3]-expressions can be considerably improved by employ- 
ing pruning rules that prove parts of a or j3 redundant. Con- 
sider the expression $ = [aO/3] = [x ® 10 + m i n y ® 20 < 15] 
which can be decomposed into sub-expressions a and j3. The 
probability P*[ls] = 1 — Pr[0s] is independent of y; indeed, 
when computing P a one may safely ignore computing P a [20] 
since it cannot contribute to P$[lg] in the convolution of [6]. 
Equivalently, we may replace $ with a simpler yet equivalent 
expression in which redundant terms are pruned. Examples 
of pruning rules are (similar and symmetric cases omitted): 



MIN 



SUM 



) mi < m 



) rrii < m 



= u 



rrii < m 



Pruning is particularly effective when the probability distri- 
butions of a and /3 have exponential size, such as in case 
of the SUM monoid; there, early pruning can avoid the full 
materialisation of such probability distributions. 
Compiling Joint Probability Distributions. A tuple 
in the result of an aggregate query in Q may have several 
semimodule expressions and be annotated with a conditional 
expression; in such cases we are interested in finding the 
joint probability distribution of these expressions. This can 
be accomplished by compiling the expressions into a sin- 
gle d-tree by applying mutex decomposition until some of 
the expressions become independent; the joint probability 
distribution of two independent random variables is simply 
their product. For instance, given integer variables a, b, c 
with non-zero probabilities for 1,2 only, the mutex decom- 
position on a decomposes the joint expression (a + b,a ■ c) 
into two branches (1 + 6, 1-c) and (2 + 6, 2- c). In each branch, 
the two expressions are independent and can be considered 
separately. For example, the probability for the value (3, 2} 
can be worked out to be P a [2]Pf,[l]P c [l] + P a [l]P b [2}P c [2}. 

6. TRACTABLE QUERY EVALUATION 

This section describes the classes Q md and Q hle of tract- 
able relational algebra queries with aggregation. The char- 
acterisation uses as building blocks queries that return tuple- 
independent relations and queries reminiscent of hierarchi- 
cal queries [21]. Class Q hl ° uses Q md as a building block, 
i.e. Q ind C Q hic . Our tractability result follows from the 
discussion in this section: 

Theorem 3 (Tractable Queries). 
Every query in Q hle has polynomial-time data complexity. 
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The characterisation is based on a generalisation of the hi- 
erarchical property for non-repeating conjunctive queries [21]. 
A query Q is non-repeating if every base relation occurs at 
most once in Q; in this section we assume all queries to be 
non-repeating. Given a query Q — ■ka (j 4>{Qi x ••• x Q n ) 
and an attribute A, we denote by A* the set of attributes 
in Q that are transitively equated with A in <j>; at(A*) de- 
notes the subset of the relation symbols Qi, - • ■ , Qn in which 
an attribute from A* appears. A non-repeating query Q — 
n A a <l> [Qi x • • • x Q n ) is hierarchical if for each two of its vari- 
ables A, B that are not in A and are not equated with a con- 
stant, it holds that at(A*) n at{B*) = or at{A*) D at(B*) 
or at(A*) C at(B*). Given a query of the above form, an 
attribute A is a rooi attribute if every relation Qi, . . . , Q„ 
contains some attribute from A* . 

Definitions 8 and 9 rely on the following notions and as- 
sumptions. Every relation that does not contain semimod- 
ule expressions and whose tuples are annotated with dis- 
tinct variable symbols with given probability distributions 
is called tuple-independent. We allow MIN and MAX aggre- 
gation as well as bounded SUM and COUNT aggregation, 
see Proposition 3. Additionally, we assume that for every 
aggregation monoid M occurring in a query, the constant 
Om does not occur in the database. 

We assume that in a selection a^, is a conjunction of 
(1) equality predicates of the form A=B or A=c where A 
and B are non-aggregation attributes and c is a constant, 
and (2) ^-comparisons a9/3, a9c, or a6A where a and f3 are 
aggregation attributes and A is a non- aggregation attribute. 

We give separate recursive definitions of the classes Q lnd 
of queries whose result tuples are pairwise independent, and 
Q le whose result tuples may be correlated. 

Definition 8 (Class Q md ). 

1. Every tuple-independent relation is a Q' nd -query. 

2. LetQ 1 ,...,Q n G Q lnd , andQ, = iz>A ini <-AGG i (.c i )(Qi)- 
Then the following are Q mi -queries: 

(a) Q = iTACtpiQi), such that 71 ^ A 

(b) Q = 7r A a 4>(Qi x • ■ • x Qn), such that Q is hierar- 
chical and all attributes in A are root variables 

(c) Q = 7T0CT 71 e 72 (Qi x Q 2 ), such that Ai — A 2 = 0. 

Concerning the queries under 8.2, first observe that the 
tuples in the result of Qi are pairwise independent and that 
the expressions 7$ have the form ji = X^agg Xi ® Vi wnere 
the independent random variables; the annotation of 

each tuple in Qi is $ = Ik in case of Ai = and a sum $ = 
Xj over the annotations of the tuples participating to this 
group, otherwise. Consider the case 8.2(a): The selection 
may compare 71 with a constant c to yield the annotation 
$ 4 = $ . [71 0c]. Under the assumption that the constant Om 
is not in the database, it can be shown that for any valuation 
v £ O it holds that = Os if and only if ^(71) = Om, 

i.e. the correlation of the distributions of $ and 71 is in- 
dependent of the data. Starting from this observation, it 
follows that the probability distribution of <E> t = <E> • [71 0c] 
can by expressed by considering the probabilities of $ and 
[71 8c] separately: For instance, for MIN aggregation and 
is <: P* t [0 s ] = P [71 < c j[0s] and P* t [l s ] = P[ 7 < c ][ls]- For 
MIN aggregation and is >: P* t [0 S ] = P*[0 S ] + P[ 71 > c ] [0 S ] 
and P$ t [ls] — A>[ls] + P[ 7 > c ][ls] — 1- Similar results are 
obtained for MAX and SUM aggregation. Since the result- 
ing tuples are again independent, the final projection tta 



creates as annotations sums of independent expressions; fur- 
thermore, as 71 £ A, the only expressions in result tuples 
are the annotation expressions. 

For 8.2(b), it is known that non-repeating hierarchical 
queries are tractable on tuple-independent relations [21]; 
since in addition all variables in the projection list are root 
variables, the result is tuple-independent. In 8.2(c), the con- 
ditional expression obtained for the result tuple compares 
two tractable and independent semimodule expressions. 

Secondly, we define Q hle as follows. 

Definition 9 (Class Q hle ). 
The following are Q h,e -queries: 

1. Q = irAiz'A; 1 <-AGG(C) [o"v<(Qi x • ■ • X Qn)] 
ifQu...,Q n €Q ind , 

and TYA^^iQi x ■ ■ • x Q n ) is hierarchical 

2. Q = ir A a 4> {Qi x • • • x Q„) ifQi,...,Q„€ Q md , and 
Q is hierarchical. 

Queries in 9.2 are the well-known non-repeating hierarchi- 
cal queries [21]. For queries in 9.1, first consider the query 
Q' = tta<74>(Qi x • • • x Q n ) which is by assumption hierar- 
chical. It follows that, given a tuple t £ Q' , its annotation 
$ = ~}2 4>i ls a read-once expression [17, 21]; moreover, "I> 
can be compiled into a d-tree whose size is bounded by the 
number of its variables. By Propositions 2 and 3, computing 
the probability distribution of such a d-tree - and thus the 
probability distribution of $ - can be done in time polyno- 
mial in the size of $. By Proposition 1, the size of the input 
annotation is polynomial in the size of the input database, 
hence query evaluation is in polynomial-time. 

Now consider the query Q (9.1). Aggregation yields ex- 
pressions of the form a = X^agg ^ ® Vi w here - as above - 
each <pi is a product of variables; since the query is non- 
repeating, each product has exactly n variables, one from 
each relation. The aggregated data values Vi are all from the 
same relation and can hence be associated with the variables 
from that relation. The expression a can be compiled into 
the same d-tree as <&, except that instead of the leaves for 
the variables x from the aggregated relation, it has leaves of 
the form x ® v. The following example illustrates this idea. 

Example 14. Consider the database in Figure 1 and the 
query Q = n70 ;a <-suM( P rice) (<7sho P ='M&s' (S) X PS) which is 
of type 1 in Q hle . In this example, the query Q' considered in 
the explanation above is Q' = 7T0cr s hop='M&s' (S) IX PS; the 
annotation of its result tuple is xiyn+x-i_yi2+X2y2i+X2y22 + 
2?32/33+:E32/34 which is equivalent to the read-once expression 
£1(2/11 + 2/12) + £2(2/21 + 2/22) + £3(2/33 + 2/34). According to 
the rewriting rules [•] in Figure 4, the annotation of the 
result of Q is (£11/11) ® 10 + (£12/12) ® 50 + (£22/21) ® 11 + 
(£22/22)®60+(£32/33)®15+(£32/34)®40. Associating the price 
values with the annotation variables yij of their relation, one 
obtains a read-once expression equivalent to the one above, 
except that the leaves with y^ variables are now semimodule 
expressions ® £1(2/11 ® 10 + 2/12 ® 50) + £2(2/21 ® 11 + 
2/22® 60) + £3(2/33® 15 + 2/34® 40). □ 

Re et al. consider queries ro0; 7 <-AGG(C) ,J <*(^i x • • • x R n ) 
in which TV0a^,(Ri x • • • x R n ) is hierarchical [19]; these are 
subsumed by Q hlc . For such queries involving aggregation 
without grouping, the neutral element of the aggregation 
monoid may safely be in the database without jeopardising 
query tractability, because the annotation the query result 
is always Ik and hence the correlation of the distributions of 
Ik and 7 is trivial, see also the discussion after Definition 8. 
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(c) COUNT, #runs = 10 (d) SUM, #runs = 10 

Figure 7: Experiment A: Varying the constant c for 
different aggregation monoids and comparison oper- 
ators 8. #v=25, L=200, R=0, #d=3, #1=3, maxv=200. 

7. EXPERIMENTAL EVALUATION 

We have assessed our query evaluation technique in two 
scenarios: randomly generated expressions and aggregate 
queries on TPC-H data. The results give positive evidence 
for the feasibility of query evaluation using our technique. 

The experiments were performed on a virtual machine 
with Intel Xeon X5650 Quad 2.67GHz/4GB/64bit running 
Linux 2.6.35-25/gcc4.4.5. Our algorithms for event con- 
struction and probability computation are implemented in 
C. They are integrated in the SPROUT query engine, which 
is an extension of PostgreSQL8.3.3. We run each experi- 
ment multiple times and report average wall-clock execution 
times and estimated standard deviation (vertical axis in all 
figures) while neglecting the slowest and fastest runs. 

7.1 Experiments on Synthetic Data 

We conducted an analysis of the qualitative behaviour of 
our technique for randomly generated semiring expressions 
over Boolean random variables of the two forms 



' L 
AGGL 



' Vi 



R 

AGGR 



£ $i ® Vi 8 c 



.AGGL 



(11) 



with the following parameters: L (R) is the number of semi- 
module terms on the left (right) of the comparison operator 
8, AGGL and AGGR are the aggregation monoids on each 
side. The second form corresponds to R=0. Values Vi,u>j 
are from [0, maxv] and the expression contains #v distinct 
variables. Each <&j has #cl clauses and each of them has 
#1 positive literals. In each experiment, we randomly gen- 
erate #runs different expressions of form Eq.(ll) according 
to the parameters specified in the respective figures. 
Experiment A explores the effect of constant c on the eval- 
uation of an expression with L=200 terms on the left side 
compared with c on the right side. Figure 7 depicts the 
run time for different aggregation and comparison opera- 
tors. Let us analyse MIN. Values Vi are drawn from [0, 200]. 
For small c, our pruning techniques ensure that only terms 
with values smaller than c are considered by convolution op- 
erators in the d-tree. For the operator <, we thus need to 
compute the probability that any of these terms is present 
(i.e., its semiring expression evaluates to true); for the op- 
erator >, we compute the probability that none of these 



terms is present; for the operator =, we compute the proba- 
bility that none of these terms is present and at least a term 
with monoid value c is present. The computation becomes 
slower when increasing c, and after c = 200, all terms need 
to be considered, hence the run time converges to a constant 
value. The behaviour of MAX mirrors that of MIN. 

Evaluating COUNT for a constant c effectively amounts to 
computing the probability that (at most, at least, or exactly) 
( ) terms are present. The binomial coefficient is trivial for 
c = 1 and c = L = 200 and bell-shaped in between. Since 
we consider three clauses per term (#cl = 3), this experi- 
ment evaluates COUNT DISTINCT on top of a conjunctive 
query. The case of SUM is equivalent to COUNT with the 
horizontal axis scaled by factor maxv/2 = 100. This is to be 
expected in the limit #runs — > oo since the aggregation val- 
ues are drawn uniformly from [0, maxv] with expected value 
maxv/2. We also considered Experiment A with L — 100 
terms and the other parameters unaltered. The qualitative 
behaviour remained unchanged and the run time halved. 
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(a) C: L=90,J?=0, #d=2, 
#1=2, maxv=5, c=3, 8 is =, 
#runs=40, AGGL=MIN. 
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Number of terms L 

(b) B: #n=25, R=0, 
#cl=3, #1=3, maxv=200, 
c=100, #rwis=10, 8 is =. 



Figure 8: Experiments C and B: Varying the num- 
ber of terms and variables. 

Experiment B (Figure 8b) investigates the effect of vary- 
ing the number of terms while keeping the number of vari- 
ables constant. The initial increase in run time is due to 
the cost of partitioning into mutex expressions, which even- 
tually saturates to linear growth for larger expressions once 
all variables have been considered for partitioning. This 
experiment mimics answering increasingly complex queries 
on a database of constant size; this produces increasingly 
larger expressions, yet with a constant number of variables. 
We repeated the above experiments (not in the figure) for 
the remaining comparison operators 8 £ {<, >} and for pa- 
rameters maxv = 5, c = 3 and obtained similar results. 
Experiment C mirrors Experiment B: Fix the size of the 
expression and vary the number of its distinct variables. It 
is known from #SAT analysis that such a setup exhibits 
an easy/hard/easy phase transition, see Figure 8a. For our 
algorithm, the transition is understood by its limiting cases: 
For few variables, expressions can quickly be decomposed 
into mutex expressions. Conversely, expressions separate 
into independent sub-expressions in case of many variables 
since it is likely that different clauses are independent. The 
large standard deviation in the hard regime suggests that 
the run time is sensitive to the particular distribution of 
variables within the expression. In case 8 is > or <, the 
runtime improves but follows a similar pattern. 
Experiment D visualises a phase transition similar to that 
of Experiment C (Figure 9). We explain for Figure 9a: 
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(a) TPC-H Query Ql (b) TPC-H Query Q2 
Figure 11: Experiment F: Queries on TPC-H Data. 
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Figure 9: Experiment D: Varying the number of 
literals per clause (a) and of clauses per term (b). 

#v=25, L=100, R=0, maxv=5, c=3, #runs=20, 6 is <. 
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(a) i?=150, L variable (b) L=150, R variable 

Figure 10: Experiment E: Expressions with differ- 
ent left /right aggregations. #w=25, #d=2, #£=2, 
maxv=200, c=100, #runs=10, 9 is <. 

When keeping the number of variables and terms constant 
while increasing the arity of clauses, the problem becomes 
easy for small and large clauses, and hard in between. 
Experiment E investigates the behaviour of expressions 
with different aggregation operators on each side (Figure 
10). We analyse the case J^max — 5^sum- When increasing 
the number of terms on the left/MAX side (left figure), it 
becomes more likely that the maximum value on the left 
side is larger than the sum on the right side; hence more 
terms have to be compiled. When increasing the number of 
terms on the right/SUM side (right figure), already a few 
mutex decomposition steps satisfy enough clauses to make 
the sum larger than the maximum on the left side, i.e. the 
compilation is faster with increasing number of SUM terms. 

7.2 Queries on TPC-H 

We consider tuple-independent TPC-H 2.14.0 databases 
of scales up to 1GB, and two TPC-H queries. The query 
Qi reports the amount of business that was billed, shipped, 
and returned (only the COUNT aggregate is selected). The 
query Q2 is a join of five relations and with a nested ag- 
gregate query, which asks for suppliers with minimum cost 
for an order for a given part in a given region. For each 



query, we compared the execution times (1) on a determin- 
istic database (Q°) without expression or probability com- 
putation, (2) of the computation of the expressions (J-]]), and 
(3) of probability computation for the result tuples (P(-)). 
Experiment F compares the runtime for Q°, [■], and P(-) 
(Figure 11). The overhead of expression computation and 
probability computation is polynomial, because the TPC-H 
dataset scales up the amount of tuples while keeping tuple 
correlations (i.e. the number of tuples directly related via 
joins within a group) constant. The performance difference 
between Q\ and Q2 is due to the selectivity of the queries: 
The size of the annotation expressions as a measure for the 
number of tuples contributing to Qi's answer is 1200 times 
larger than the size of Q2's annotations. 
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